Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
today <- date()
paste("Today is", today, ". Hello R! Hello world!")
## [1] "Today is Thu Dec 3 19:02:54 2020 . Hello R! Hello world!"
How I found out about this course: I already came across this course last year (unfortunately didn’t have the time to participate) and was already intrigued about the contents and impressed by the fabulous feedback.
My background and expectations: As a graduate student in physics I got quite fond of using python for my data analysis work. Now, I am quite excited to see and compare what is possible with R - I am looking forward to participate in this interesting course! So far, the “warming up” week went fine and the introductions were easy to follow. The syntax shown so far didn’t look too complicated and partially familiar to python or python libraries such as pandas and matplotlib. As I am quite used to python (and it is more and more used in natural sciences), I don’t expect to switch to R after the course, but would love to learn more about it and achieve an understanding of what the pros and cons are. So if something comes up in the future for what R is actually more suited, I will be able to use it or – if collaborators use it – able to understand it. And furthermore, it’s always fun to learn new skills.:)
Link to my GitHub repository : https://github.com/kirstef/IODS-project
Looking forward to next week!
Describe the work you have done this week and summarize your learning.
In this chapter we will explore the dataset from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt, looking first closer at the dataframe and afterwards visualize some interesting features.
REMARK: The data wrangling part is done in the .R file in the data-folder. Still, I will show a preview of the original dataset in the diary as it shows the development from the original data to the dataset we finally evaluate.
First let’s read in the full learning2014 data as a dataframe from the given website.
lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)
head(lrn14)
## Aa Ab Ac Ad Ae Af ST01 SU02 D03 ST04 SU05 D06 D07 SU08 ST09 SU10 D11 ST12
## 1 3 1 2 1 1 1 4 2 4 4 2 4 4 3 3 2 3 3
## 2 2 2 2 2 1 1 4 2 4 4 4 2 3 4 4 1 4 1
## 3 4 1 1 1 1 1 3 1 4 4 2 3 4 1 3 1 4 4
## 4 4 2 3 2 1 1 3 3 4 4 3 4 4 2 3 1 3 3
## 5 3 2 2 1 2 1 4 2 5 3 4 4 4 3 4 2 4 2
## 6 4 2 1 1 1 1 4 3 5 4 3 5 5 4 4 1 5 3
## SU13 D14 D15 SU16 ST17 SU18 D19 ST20 SU21 D22 D23 SU24 ST25 SU26 D27 ST28
## 1 3 4 3 2 3 2 4 2 3 3 2 2 4 4 4 4
## 2 3 2 3 4 4 2 3 1 2 2 3 4 2 4 2 2
## 3 2 4 2 3 3 1 4 3 2 4 3 3 4 4 3 5
## 4 2 4 3 2 3 1 3 3 3 3 3 2 3 2 3 3
## 5 3 4 3 3 4 1 4 3 2 3 3 4 4 3 3 5
## 6 1 5 4 2 3 2 4 3 4 5 4 2 4 2 5 4
## SU29 D30 D31 SU32 Ca Cb Cc Cd Ce Cf Cg Ch Da Db Dc Dd De Df Dg Dh Di Dj Age
## 1 3 4 4 3 2 4 3 4 3 2 3 4 3 4 4 5 4 2 4 3 4 4 53
## 2 3 3 4 5 4 4 4 5 5 3 2 4 4 3 3 4 3 2 3 3 2 4 55
## 3 2 4 3 5 3 5 4 4 3 4 4 2 1 4 4 1 4 1 3 1 1 5 49
## 4 3 4 4 3 3 4 4 4 3 4 4 3 2 4 5 2 5 1 5 4 2 5 53
## 5 3 3 4 4 2 4 4 3 3 3 4 4 3 4 4 4 4 2 5 5 3 3 49
## 6 2 5 5 3 3 5 4 4 3 4 5 4 3 5 4 4 4 3 4 3 3 5 38
## Attitude Points gender
## 1 37 25 F
## 2 31 12 M
## 3 25 24 F
## 4 35 10 M
## 5 37 22 M
## 6 38 21 F
Only looking at the head of the dataframe makes it quite visible that some data wrangling will help in understanding the data better. It also shows that meta data is quite valuable as we don’t know by just looking at the variable what their meaning is.
For this we have to include the library dplyr (library(dplyr)) which is a common R library used for data wrangling
library(dplyr)
First let’s use read.csv() or read.table() to read in the data subset created before and saved as a .csv in the project data-subfolder.
students2014 <- read.csv("./data/learning2014.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
With head(students2014) we can display the first six rows of the subset. It is now much better readable and interpretable than before.
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
Structure and dimension of the subset
With dim() and str() we can further explore the dimension and structure of the dataset.
dim(students2014)
## [1] 183 7
str(students2014)
## 'data.frame': 183 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim() already gives us that the subset consists of 7 variables and 183 observations. str() gives further information: The variables are those that we defined before (cf. R-script) and consist of 4 combined variables of type numerical which give us the grade that the students obtained in the respective categories in the scale (0-5), 2 integer type variables giving the age and the points, and one character type variable stating the gender.
Getting rid of 0 values We can filter our subset further to e.g. exclude the students who didn’t participate in the final exam and look afterwards again at the dimension.
# select rows where points are greater than zero
students2014 <- filter(students2014, points > 0)
dim(students2014)
## [1] 166 7
As one can see we have \(183-166=17\) rows with 0 points, so students not taking the final exam. Let’s keep it like this for the further exploration of the data set, as the reasons for not taking the exam are unknown to us and thus don’t necessary have a high relation to the attitude our other variables.
Using ggpairs gives us a good graphical overview of the data and the correlation of the different variables to each other. Using col=gender in the mapping argument let’s us also see gender-differences (color-coded, female(red), male(blueish)).
library(ggplot2)
library(GGally)
# create a plot matrix with ggpairs()
ggpairs(students2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
Studying the plot in more detail, we see the following parts: scatter plots between the different variables, distributions of the different variables, and correlation values between the different variables. The higher the absolute value of the correlation value, the higher the correlation between the respective variables. As we can seen, the highest correlation seems to be between points and attitude. Looking at the scatter plot between these two variable we can also see that the data points are roughly scattered along a line, which hints as well at a relation between them. The point distribution for the attitude shows a visible gender difference. Where the distribution for the female students rises to about 2.5 to almost a “plateau” (until 3.5), the distribution for the male students shows only a small fraction of people having the grade 2 or less - after that there is a steep rise to the peak at a bit over 3. The surface question distribution (“surf”) shows a maximum at about 3 for the female students, which is higher than for the male students. For the age one can see (as expected) that most of the students are between 20 and 30 years old.
summary(students2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The summary() command gives us again an overview on the statistics of the different variables. We have about twice as many female students as male students, the mean age is between 25-26 years, the mean grade for the question types (deep, strategic, surface) lies between 2.8 and 3.7, with deep questions scoring highest. The mean attitude lies at 3.2 and the points at about 23. Apart from these mean values, we get further information: the minimum and maximum values, the 1st and 3rd quartiles and the median.
From the pairs plot above let’s choose the points as our target value and the 3 variables with the highest correlation to points as explanatory variables to be used for the regression model: exam points ~ x1 + x2 + x3. The respective variables would be attitude (corr: 0.437), stra (corr: 0.146) and surf (corr: -0.140).
summary(model) gives us further information on the validity and significance of our model. Further description and interpretation will still be added.
# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
To find more out abouth the significance of the correlations we can look at t-value and Pr(>|t|). The understanding of the t-value is here, that it get’s bigger if the Null-Hypothesis is not true. The Null-Hypothesis in this case would be that there is no relation between the variable and the target value. The t-value on its own is difficult to interpret, but Pr(>|t|) gives us then the probability for getting the same t-value if the Null-Hypothesis would be true. As one can see, this probability is very low for attitude, which is also implied by the *** next to the value, which are described as significance codes in the legend. Thus, a relation between points and attitude is quite obvious. Both stra and surf don’t have any significant relation to the target variable, thus I will remove them from the model.
Just for fun once more the model with attitude and stra:
my_model2 <- lm(points ~ attitude + stra, data = students2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
The t-value is a bit higher and the Pr(>|t|) a bit smaller, but still the relation is quite insignificant.
After having also remove the stra variable from the model, the model is reduced to the following one:
my_model3 <- lm(points ~ attitude, data = students2014)
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The values show no a clear significance of the relation between points and attitude. So what does that mean? Apparently, the attitude towards the course influence the final result in the exam (points). That relation is not really surprising: If you like a course and are interested in it, you are probably learning more or studying more for it and will probably easier get more points. The model created now could now be used to predict the points of a person with attitude=x.
Prediction of points for students with different attitudes
new_students <- c("Student X" = 2.5, "Student Y" = 4.8)
new_data <- data.frame(attitude = new_students)
# Predict the new students exam points based on attitude
predict(my_model3, newdata = new_data)
## Student X Student Y
## 20.45080 28.55936
Plotting the two variables against each other we can see that the predicted points are right where they should be (as it makes sense, because it is the same model). However, we can see that there is still a lot of scattering of the points above and below the regression lines. But, the tendency of the relationship is quite visible. The multiple R-squared value is a measure for how good our fitted model is and how strong the relation. The value of about 0.2 explains for about 20% variation from the mean. So the model might not be perfect, but still might be good. To judge this further, one has to take a look at the residuals.
library(ggplot2)
# plotting the two variables against each other with aesthetic mapping
p1 <- ggplot(students2014, aes(x = attitude, y = points, col=gender))
# using points for visualization
p2 <- p1 + geom_point()
# add a regression line and plot
p3 <- p2 + geom_smooth(method = "lm")
p3
## `geom_smooth()` using formula 'y ~ x'
The Residuals vs. Fitted Values plot shows that the errors are quite normally distributed both above and under the 0-line, which is a good sine for our model.
The QQ plot explains the behaviour for most of inner quantiles and only diverges from the line in the outer quantiles.
The residuals vs. Leverage plot can show us if some points have too much leverage (outliers). In this case there is no outlier visible that completely changes the trend of the curve.
#par(mfrow = c(2,2)) #To put the images in the same figures. However, I prefer them a bit bigger right now.
plot(my_model3, which=c(1,2,5))
The data wrangling part is performed in the R-script create_alc.R in the data folder and the created dataset is saved as a .csv file (‘alc.csv’) in the data folder.
The dataset we want to look at is a joined dataset from two datasets obtained from the following website: https://archive.ics.uci.edu/ml/datasets/Student+Performance , where it is also further described. It studies the performance of secondary education students of two Portuguese schools in mathematics and portuguese. The data sets content are answers of the students to different questions, which concern different areas of working and personal life, social background, circumstance etc.
Here, in our analysis, we will in particular look at the alcohol consumption and its relation to other variables.
To start the data analysis part, we include the library dplyr (library(dplyr)) and read in the dataset.
library(dplyr)
alc <- read.csv("./data/alc.csv")
To get an overview on the dataset, let’s print out the names of the variables and get the dimension values. The meaning of the values can be found on the above-mentioned website.
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
dim(alc)
## [1] 382 36
In total, we have 382 observations and 36 variables in the joined alc dataset.
Choose 4 interesting variables from the data and formulate hypothesis towards their relationship with alcohol consumption
Hypothesis 1: sex: There is a statistical difference concerning sex: Male students are more likely to have a high alcohol consumption than female students.
Hypothesis 2 failure: The number of past class failures has a relation to alcohol consumption (more failures leading to more alcohol consumption).
Hypothesis 3 famrel: Good family relationships make high alcohol consume less likely.
Hypothesis 4 goout: The frequency of going out with friends will affect the alcohol consumption.
With gather() and gplot() one can first get an overview bar plot for each variable.
library(tidyr); library(dplyr); library(ggplot2)
# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Now we can continue by looking at the values interesting for the hypotheses stated above.
Hypothesis 1: gender and alcoholic consumption
alc %>% group_by(sex, high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex high_use count
## <chr> <lgl> <int>
## 1 F FALSE 156
## 2 F TRUE 42
## 3 M FALSE 112
## 4 M TRUE 72
The cross-tabulation is supporting Hypothesis 1. Whereas 21.2121212% of the female students mention a high alcohol consumption, 39.1304348% of the male students have mentioned a high use, which is almost the double value. To look at this graphically, let’s look at a bar plot.
library(ggplot2)
g2 <- ggplot(alc, aes(x = high_use, col=sex))
g2 + geom_bar()
The bar plot shows quite clearly, that the difference between high-use and no high-use are quite different for male and female students.
Hypothesis 2 - failures: Let’s now investigate if one can see a clear relation between the failures and the alcohol consumption.
alc %>% group_by(high_use) %>% summarise(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## high_use count
## <lgl> <int>
## 1 FALSE 268
## 2 TRUE 114
alc %>% group_by(failures) %>% summarise(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## failures count
## <int> <int>
## 1 0 334
## 2 1 24
## 3 2 19
## 4 3 5
alc %>% group_by(high_use, failures) %>% summarise(count=n())
## `summarise()` regrouping output by 'high_use' (override with `.groups` argument)
## # A tibble: 8 x 3
## # Groups: high_use [2]
## high_use failures count
## <lgl> <int> <int>
## 1 FALSE 0 244
## 2 FALSE 1 12
## 3 FALSE 2 10
## 4 FALSE 3 2
## 5 TRUE 0 90
## 6 TRUE 1 12
## 7 TRUE 2 9
## 8 TRUE 3 3
alc %>% group_by(high_use, failures>0) %>% summarise(count=n())
## `summarise()` regrouping output by 'high_use' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: high_use [2]
## high_use `failures > 0` count
## <lgl> <lgl> <int>
## 1 FALSE FALSE 244
## 2 FALSE TRUE 24
## 3 TRUE FALSE 90
## 4 TRUE TRUE 24
As one can see, 5 of the 382 students have experienced 4 failures. From these 5, three report high alcohol consumption. This sample might however be to small to judge if a real relationship exist. Let’s create a bar and a box plot to investigate further.
g3 <- ggplot(alc, aes(x=failures, col=high_use))
g3 + geom_bar()
g4 <- ggplot(alc, aes(x = sex, y = alc_use, col=failures>0))
g4 + geom_boxplot()
Looking at the bar plot, one might at least come to the conclusion, that having experienced no failure makes it less likely to become addicted to high alcoholic consumption. However, the sample for having a failure is much less than for having none. It might make more sense to combine the numbers for failures > 1 together. Now, a box plot gives a nice visualisation of how having experienced at least 1 failure can impact the alcoholic consumption. Again, the impact seems to be much higher for the male students.
Hypothesis 3 - Family:
alc %>% group_by(famrel,high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'famrel' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups: famrel [5]
## famrel high_use count
## <int> <lgl> <int>
## 1 1 FALSE 6
## 2 1 TRUE 2
## 3 2 FALSE 10
## 4 2 TRUE 9
## 5 3 FALSE 39
## 6 3 TRUE 25
## 7 4 FALSE 135
## 8 4 TRUE 54
## 9 5 FALSE 78
## 10 5 TRUE 24
g5 <- ggplot(alc, aes(x = high_use, y = famrel))
g5 + geom_boxplot() + ylab("quality of family relationships (from 1 - very bad to 5 - excellent)")
g6 <- ggplot(alc, aes(x=famrel, col=high_use))
g6 + geom_bar()
Both the box plot as well as the bar plot seem to at least give a trend on the correctness of the hypothesis: Whereas the mean value for the quality of the family relationships is at 3.5 for people stating a high alcohol consumption, it has a better mean quality (4.5) for students without hinting at alcohol problems. The bar plot gives a bit more clearer view on the different family relationship bins. The high-use fraction is getting more for famrel=2 or 3 than for 4 and 5. For famrel=1 the sample is probably to small to be judge.
Hypothesis 4 - going out
alc %>% group_by(goout,high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'goout' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups: goout [5]
## goout high_use count
## <int> <lgl> <int>
## 1 1 FALSE 19
## 2 1 TRUE 3
## 3 2 FALSE 84
## 4 2 TRUE 16
## 5 3 FALSE 103
## 6 3 TRUE 23
## 7 4 FALSE 41
## 8 4 TRUE 40
## 9 5 FALSE 21
## 10 5 TRUE 32
g7 <- ggplot(alc, aes(x = high_use, y = goout, col=sex))
g7 + geom_boxplot() + ylab("going out with friends (from 1 [very low] to 5 [very high]")
g8 <- ggplot(alc, aes(x=goout, col=high_use))
g8 + geom_bar()
Again, the numeric cross tabulations as well as the plots are supporting the hypothesis. Going out with friends more often (higher value) seems to have an impact on the alcohol consumption.
# find the model with glm()
m <- glm(high_use ~ sex + failures + famrel + goout, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + failures + famrel + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6492 -0.7464 -0.5207 0.8642 2.4194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3922 0.6415 -3.729 0.000192 ***
## sexM 0.8976 0.2530 3.548 0.000388 ***
## failures 0.3339 0.2034 1.641 0.100696
## famrel -0.3971 0.1390 -2.857 0.004281 **
## goout 0.7751 0.1217 6.367 1.93e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 391.08 on 377 degrees of freedom
## AIC: 401.08
##
## Number of Fisher Scoring iterations: 4
The model summary shows a high significance for the correlation to sexM and goout, a medium significance for famrel und no signifant value for failure. One could fit the model again, dropping at least the failure variable. But first let’s study the odds ratios (OR).
coef(m)
## (Intercept) sexM failures famrel goout
## -2.3922069 0.8976484 0.3338923 -0.3971296 0.7751449
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.09142769 0.02506625 0.3124735
## sexM 2.45382585 1.50205707 4.0576659
## failures 1.39639281 0.93864152 2.0963627
## famrel 0.67224693 0.51012773 0.8814055
## goout 2.17090677 1.72124671 2.7771652
The odds ratios can tell us if having a certain property (e.g. being male, going out often) can be positively or negatively associated with the logical value looked at (here high alcohol consumption). If the OR is > 1, it means it is positively associated, with < 1 it is negatively associated.
The male sex variable has the strongest relationship to the binary target variable, with an odds ratio of +2.45 - that means that the probability for a male student having high alcohol consumption is about 2.5 higher than for a female student. “going out” is also positively associated with high alcohol consumption, wheres “family relation” has a negative relation (high quality of family relation leading more likely to low alcohol consumption). These OR thus support the above-mentioned hypotheses.
As failures didn’t have a significant relationship according to my logistic regression model, I will drop this variable and fit the model again, print out the summary and calculate the OR and Confidence levels:
m2 <- glm(high_use ~ sex + famrel + goout, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ sex + famrel + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5836 -0.7707 -0.5316 0.8198 2.5474
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3550 0.6414 -3.672 0.000241 ***
## sexM 0.9342 0.2514 3.716 0.000202 ***
## famrel -0.4118 0.1387 -2.969 0.002989 **
## goout 0.7971 0.1214 6.565 5.2e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 393.80 on 378 degrees of freedom
## AIC: 401.8
##
## Number of Fisher Scoring iterations: 4
coef(m2)
## (Intercept) sexM famrel goout
## -2.3549669 0.9341541 -0.4117517 0.7971166
OR <- coef(m2) %>% exp
CI <- confint(m2) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.09489664 0.02602013 0.3241397
## sexM 2.54505957 1.56349803 4.1969640
## famrel 0.66248876 0.50299888 0.8679153
## goout 2.21913298 1.76086023 2.8372779
As a next step let’s check the predictive power of the new model, adding the new columns probability and predictions to the alc dataset. probablities will be calculated by using the predict() function on the model and predictions will defined as TRUE if the probability is higher than 50%.
probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities > 0.5)
To see an example on how good the model is, we can print the first rows of the columns we are interested in and look at the confusion matrix:
select(alc, sex, famrel, goout, high_use, probability, prediction) %>% tail(10)
## sex famrel goout high_use probability prediction
## 373 M 4 2 FALSE 0.18639810 FALSE
## 374 M 5 3 TRUE 0.25195331 FALSE
## 375 F 5 3 FALSE 0.11687357 FALSE
## 376 F 4 3 FALSE 0.16650200 FALSE
## 377 F 5 2 FALSE 0.05627990 FALSE
## 378 F 4 4 FALSE 0.30714360 FALSE
## 379 F 2 2 FALSE 0.17019623 FALSE
## 380 F 1 1 FALSE 0.12243164 FALSE
## 381 M 2 5 TRUE 0.85084788 TRUE
## 382 M 4 1 TRUE 0.09357856 FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 250 18
## TRUE 63 51
To graphically visualize the actual values and the predictions, we can draw a plot of ‘high use’ vs. ‘probability’.
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
g + geom_point()
Furthermore, we can do another cross-tab of predictions vs. actual values, printing out the fractions instead and adding margins.
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65445026 0.04712042 0.70157068
## TRUE 0.16492147 0.13350785 0.29842932
## Sum 0.81937173 0.18062827 1.00000000
Accuracy of the model The proportion of inaccurately classified individuals (training error) can be calculated by taking the values from the confusion matrix: (18+63)/(250+18+63+51) = 0.21. We will get the same value by defiing a loss function as below:
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2120419
The loss function gives us a value of about 0.21 (as also calculated before), meaning that about 21% of the predictions will be wrong. This value is better than the value in the DataCamp example (about 26%), making this model slightly better.
To study the test set performance of the model one can use K-fold cross-validation. Here, we will try out 10-fold cross-validation, making use of the loss function defined above.
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K=10)
cv$delta[1]
## [1] 0.2225131
The cv.glm function of the library boot can be used for this and K=10 defines the number of subsets we are doing the cross-validation on. The delta attribute stores the error value. It gives an error of about 0.23 which is better than the 0.26 error of the DataCamp model. Thus, it has a bit better performance.
This chapter is focused on clustering and classification and the Boston dataset will be the dataset to be explored. The dataset is already included in the internal MASS package so it can be loaded directly.
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
To get a first impression on what dataset we are dealing with, let’s look at its dimensions (dim(Boston)) and structure (str(Boston)) .
## [1] 506 14
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The dataset is made up of 506 observations and 14 variables, describing the housing values in suburbs of Boston, including variables such as the per capita crime rate by town (crim), proportion of non-retail business acres per town (indus), nitrogen oxides concentration (nox), average number of rooms per dewelling (rm). Two of the variables (chas- Charles River dummy variable and rad - index of accessibility to radial highways) are integer type, the other numeric. Further information and the definitions for the different variables can be found at https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html.
Let’s see if a `pairs() plot can show us some interesting things:
A summary (summary(Boston)) can give us more insight:
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Both the pairs plot and the summary are not easy to interpret just by a first glance. Another option is to make use of a correlation plot - thus to graphically print the correlation between the different variables.
library(corrplot)
library(tidyverse)
cor_matrix <- cor(Boston) %>% round(digits=2)
corrplot.mixed(cor_matrix, number.cex = .7)
In this plot we can nicely see, which variables are correlated the most by looking at the colours/diameters of the circles. For example those circles with a darker shade of blue have a correlation value close to 1 and one can compare with the pairs plot above and see that they have a linear tendency (e.g. nox and indus - denoting “nitrogen oxides concentration” and “proportion of non-retail business acres per town”). Then there are those variables which relationship to each other is visualised by a very small and nearly white circle, and those are the once with no significant relationship (e.g. rm and black - “average number of rooms per dwelling” vs. a value of black population in the town), which is also visible in the pair plot. Plotting the correlation plot furthermore as corrplot.mixed provides us additionally with the numerical correlation values.
Scaling the data
To scale the data the scale() function can be used on the Boston dataset. Below a summary of the scaled data:
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
One can see that as a result to the scaling the values are now distributed around a mean value of 1, running thus from negative to positive. For further investigation we save the scaled data as well in the dataframe format and look at the first six lines:
boston_scaled <- as.data.frame(boston_scaled) # save data in a dataframe
head(boston_scaled)
## crim zn indus chas nox rm age
## 1 -0.4193669 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
## dis rad tax ptratio black lstat medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375
## 4 1.076671 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886
## 5 1.076671 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323
## 6 1.076671 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582
Categorical variable for the crime rate
The per capita crime rate by town can be found in the dataset in the continuous variable crim. This continuos variable should now be changed into a categorical one, using quantiles as break boints. Let’s look at the quantiles first with summary(boston_scaled$crim) and then define our bins:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
bins <- quantile(boston_scaled$crim)
The new factor variable crime will hold the crime data, categorized by “low”, “med_low”, “med_high” and “high”.
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Instead of the old crim variable, let’s now add the new crime variable to the boston_scaled dataframe and drop the old variable crim.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Divide the data set into training and test sets
To divide the dataset into training and test sets we first need to know the length of the set, which we can do by using nrow(). A common partition is to use 80% of the data for training and 20% for testing, which we will do here. With sample() we can randomly choose a percentage of the given numbers and store them in a variable. Then we can build our train and test set with the randomly chosen rows.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
A table overview on the crime categories of the train and test sets gives us a first insight into the distribution amongst the different categories.
traincrime <- table(train$crime)
testcrime <- table(test$crime)
traincrime
##
## low med_low med_high high
## 102 96 98 108
testcrime
##
## low med_low med_high high
## 25 30 28 19
LDA Fit on the crime training set
Linear Discriminant Analysis can be used as a classification method to find a linear combination of the variables in relation to the target variable classes. It is fit with the function lda() and takes as an input a function (e.g. target ~ x1 + x2 …) and the dataset. target ~ . means that all other variables in the dataset are used as predictors.
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2376238 0.2425743 0.2673267
##
## Group means:
## zn indus chas nox rm age
## low 1.02060578 -0.9282113 -0.11793298 -0.8733936 0.4315140 -0.9230130
## med_low -0.09978273 -0.3723399 0.05576262 -0.6139579 -0.1388989 -0.3923673
## med_high -0.41548661 0.2537587 0.12941585 0.3383128 0.1105463 0.4561245
## high -0.48724019 1.0149946 -0.01714665 1.0544272 -0.3945765 0.8002135
## dis rad tax ptratio black lstat
## low 0.8410313 -0.6913766 -0.7366847 -0.44725315 0.3838047 -0.78362389
## med_low 0.3906120 -0.5344476 -0.5262050 -0.04970884 0.3154788 -0.17748597
## med_high -0.3555123 -0.3970905 -0.2764631 -0.21465541 0.1080257 -0.02492083
## high -0.8361555 1.6596029 1.5294129 0.80577843 -0.7702986 0.86810049
## medv
## low 0.54008236
## med_low 0.01523418
## med_high 0.17561537
## high -0.66702063
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09446400 0.820847651 -0.96582703
## indus 0.13712959 -0.510616770 0.13334400
## chas -0.05832506 -0.050540710 0.16767883
## nox 0.23880572 -0.570161438 -1.31665726
## rm -0.07379480 -0.150250320 -0.21298121
## age 0.30576850 -0.455826365 -0.15658342
## dis -0.06709117 -0.532099958 0.23301883
## rad 3.33897765 1.031374320 0.24620265
## tax 0.09667025 -0.102060062 0.29121988
## ptratio 0.13077399 0.066338044 -0.22293465
## black -0.13655882 -0.002331274 0.06768378
## lstat 0.15866833 -0.249789900 0.32366804
## medv 0.16726719 -0.397575572 -0.19876941
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9510 0.0377 0.0113
The print of the fit shows us the result of the LDA fit on the train set, with the output showing us the three extracted discriminant functions LD1, LD2 and LD3 with the highest value being 0.94 for LD1. We get three discriminant values as we have four different groups (“low”,“med_low”, “med_high”, “high”) and discrinants are always one less than the number of groups.
LDA (bi)plot
For the LDA biplot a function has to be created to show as well the lda arrows for the different variables in the plot. The code for this function was taken from the datacamp exercise which refers to following Stack Overflow message thread. Then the classes are stored in a numeric vector and plotted, together with the arrows, in a LDA (bi)plot.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2,col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
From the plot we can see that the variable rad has a major contribution to LD1. Looking at the definition for rad (“index of accessibility to radial highways”), it seems that the accessibility to radial highways as a predictor enlargens the probability of being placed into the high category.
Before going to the next step, let’s save the crime categories from the test set in correct_classes and the remove crime from the test dataset.
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Now the lda.fit model can be used to predict the crime values of the before-created test dataset:
lda.pred <- predict(lda.fit, newdata = test)
A cross-tabulation with the original crime values (stored in correct_classes) can give a nice overview on how well the fitting worked.
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 12 2 0
## med_low 3 18 9 0
## med_high 2 10 15 1
## high 0 0 1 18
Looking at the cross-tabulation one can see that for this data set the high prediction is almost perfect, the med-high classification quite good, but for the lower categories low and high the model could be improved.
First, let’s reload the Boston dataset and scale it to standardize the data, before comparing the distances.
# load MASS and Boston
library(MASS)
data('Boston')
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2) # save data in a dataframe
Distances
Without specifying the method as an attribute in the dist(0) function, the euclidian distance is calculated. The method can be changed, e.g. to method="manhattan" which will then have different results.
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
K-Means clustering
Let’s do K-Means clustering on the scaled dataset, starting with 4 as a first try for the number of cluster centers. The results can be looked at with pairs.
km <-kmeans(boston_scaled2, centers = 4)
pairs(boston_scaled2, col = km$cluster)
As it is difficult to read anything in the total pairs plot, let’s divide it exemplarily into 4 parts:
pairs(boston_scaled2[1:4], col = km$cluster)
pairs(boston_scaled2[5:8], col = km$cluster)
pairs(boston_scaled2[9:11], col = km$cluster)
pairs(boston_scaled2[12:14], col = km$cluster)
I will stick to the [5:8] part of the dataset and investigate how the number of clusters (2,3,5) will change the result of the initial number of 4 clusters.
km <-kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2[5:8], col = km$cluster)
km <-kmeans(boston_scaled2, centers = 3)
pairs(boston_scaled2[5:8], col = km$cluster)
km <-kmeans(boston_scaled2, centers = 5)
pairs(boston_scaled2[5:8], col = km$cluster)
It seems to be difficult to decide which number of clusters is the best. One can use the total of within cluster sum of squares (WCSS) to help with the decision. The optimal number of clusters can be seen as the point when the total WCSS is dropping radically. Let’s investigate the behaviour of the total WCSS from 1 cluster to 10 clusters.
set.seed(123) # use a certain seed for the initial cluster centers
k_max <- 10 # set the maximum number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss}) #calculate the WCSS
qplot(x = 1:k_max, y = twcss, geom = 'line')
The total WCSS is falling steeply until ~2 clusters, afterwards a bit less and between 3 and 4 it is again rising and falling. So we shouldn’t use more than 3 clusters and 2 clusters would probably be the optimal value.
Let’s rerun the K-means clustering with 2 for the whole dataset.
km <-kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2, col = km$cluster)
The new datasets to be looked at are about the Human Development Index and the Gender Inequality Index in different countries around the world. The data orginates from the United Nations Development Programme and further informations can be found here. Some preliminary data wrangling has been done on the data and a combined dataset has been constructed out of it, excluding some of the variables and removing some missing values. All the mutations to the dataset are documented in the “create_human.R” file (in the data folder) and the new variable names are also defined there.
Let’s start by including the new prepared dataset human and look at dimensions, summaries and a graphical overview.
library(dplyr)
human <- read.table('./data/human.csv')
human <- as.data.frame(human)
dim(human)
## [1] 155 8
head(human)
## SecEduRat LabForceRat EduYearsExp LifeExp GNI MatMortRat
## Norway 1.0072389 0.8908297 17.5 81.6 64992 4
## Australia 0.9968288 0.8189415 20.2 82.4 42261 6
## Switzerland 0.9834369 0.8251001 15.8 83.0 56431 6
## Denmark 0.9886128 0.8840361 18.7 80.2 44025 5
## Netherlands 0.9690608 0.8286119 17.9 81.6 45435 6
## Germany 0.9927835 0.8072289 16.5 80.9 43919 7
## AdoBirthRate PercParl
## Norway 7.8 39.6
## Australia 12.1 30.5
## Switzerland 1.9 28.5
## Denmark 5.1 38.0
## Netherlands 6.2 36.9
## Germany 3.8 36.9
We have 155 observations for different countries (row-names) and 8 variables, consisting of the ratio of females vs. males having at least secondary education (SecEduRat), the ratio of females vs. males in the labour force (LabForceRat), the expected years of schooling (EduYearsExp), life expectancy at birth (LifeExp), Gross National Income per capita (GNI), Maternal Mortality ratio (MatMortRat), Adolescent birth rate (AdoBirthRate) and percentage of female representatives in parliament (PercParl).
Visualizations
First we have to include some libraries to perform the visualisations.
library(GGally)
library(corrplot)
library(tidyverse)
First, with ggpairs(), let’s look at a pairplot and afterwards with corrplot() at a visualization of the correlation matrix.
ggpairs(human)
cor(human) %>% corrplot.mixed(number.cex = .7, tl.cex=0.5)
Looking at both plots, we can see that the largest correlation is between the life expectancy at birth and the maternal mortality ratio (0.86). This doesn’t sound too surprising, as probably a country with a high life expectancy value should mean that the facilities e.g. in hospitals are quite good and thus also better for birth-giving - and on the other hand a low life expectancy correlating with a high maternal mortality rate. Another high correlation is between EduYearsExp and LifeExp (0.79), letting us conclude that in countries where the life expectancy is higher, the education system is probably better as well. Another significant correlation (0.76) is between MatMortRat and AdoBirthRate. The adolescent birth rate is the birth rate per 1000 women aged 15-19 (description see here). The high correlation with the maternal mortality rate would support the statement given on the above-mentioned WHO site that people giving birth early “are subject to higher risks or complications or even death during pregnancy and birth”.
Let’s look at these two variables as an example.
library(ggplot2)
g <- ggplot(human, aes(x = AdoBirthRate, y=MatMortRat))
g + geom_point()
The trend is quite visible.
A summary() of the data gives us once more information about the variables we are dealing with.
summary(human)
## SecEduRat LabForceRat EduYearsExp LifeExp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI MatMortRat AdoBirthRate PercParl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
To gain some more understanding about the correlations between the different variables we want to use Principal Component Analysis (PCA), to extract the principal components of our data matrix and create a lower dimension representation with those. We will perform the PCA with the Singular Value Decomposition (SVD) method, which can be done in R using the prcomp() function. As a start, the PCA will be performed on the non-standardized data and a biplot is drawn with the two leading principal components (PC1, PC2).
pca_human <- prcomp(human)
s <- summary(pca_human)
s # Look at the summary of the pca values
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
pca_pr <- round(100*s$importance[2, ], digits=1) #Round the values and print them as percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
biplot(pca_human, choices = 1:2,cex = c(0.6, 0.8),col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Already the error messages tell us that we face a problem here: The angles can’t be drawn in a correct way, and the variables can’t be compared without standardizing them. Most of the data is accumulated in one corner of the plot. We can see that the variability is captured only by the first principal component (100%), whereas all the other PC are 0. GNI might have the highest contribution here, as it is the only arrow visible. One can see clearly, that a standardization of the variables is needed.
Let’s standardize the data now first and then look again at the importance of the different principal components.
human_std <- scale(human)
summary(human_std)
## SecEduRat LabForceRat EduYearsExp LifeExp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI MatMortRat AdoBirthRate PercParl
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
The variables are now normalized, as can be seen above. Let’s see how this will now affect the PCA.
pca_human_std <- prcomp(human_std)
s2 <- summary(pca_human_std)
s2 # Look at the summary of the pca values
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
pca_pr_std <- round(100*s2$importance[2, ], digits=1) #Round the values and print them as percentages of variance
pca_pr_std
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
Now the contributions of the different PC make more sense: The first principal component captures about 53% of the variability, the second one 16.2%, followed by PC3(9.6%), PC4(7.6%), PC5(5.5%), PC6(3.6%), PC7(2.6%) and PC8(1.3%). Now let’s make a new plot, stating the importance of the first two components in the labels.
# Create a labeling object for the x- and y axis
pc_lab <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")
# Draw the new biplot with scaled values and labels for the axes
biplot(pca_human_std, cex = c(0.6, 0.8), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
We can now differentiate between the different arrows and make some comments on their relationships. We have seen before that AdoBirthRate and MatMortRat have a significant correlation. Looking at the PCA biplot of the standardized dataset, we can see that the angle between those values is very small, which also supports a high correlation. Moreover, the angle between these variables and the PC1 axis is quite small and almost orthogonal to PC2, which means they have a high correlation to the first principal component but not to the second. The only two variables which seem to have the main impact to PC2 are PercParl and LabForceRat. Based on the biplot drawn after the PCA, I would say that the first two principal component dimensions describe the data quite well, as the variables seem to have a good correlation with either the first or the second component. Looking at the standard deviations of the different components (could be visualised in a scree plot), one can see that only PC1 and PC2 are above 1, whereas PC3 is already below 1. This also suggests that the first two components could be enough to describe the correlations. The first two components together account for almost 70% of the variance in the dataset.
Next, we will explore the tea dataset from the package FactoMineR. The data comes from a questionnaire on tea-drinking (e.g. where, when, what and some personal data). More information on the dataset can be found here. First, we have to import the FactoMineR package and load the tea dataset.
library(FactoMineR)
library(ggplot2)
library(dplyr)
data(tea)
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
Looking at the dimensions and structure of the dataset we can see it is made up of 300 observations (people interviewed) and 36 variables. Using a pairs or ggpairs plot on the complete dataset would be too much. (Tried it, takes forever, and you can’t decipher anything.;)) So, let’s look directly at a subset of the data.
I will keep the same subset as was examined in the datacamp exercise, look at the summary and gather the variables in bar plots.
#library(FactoMineR); library(dplyr); library(ggplot2); library(tidyr)
# column names to keep in the new dataset tea_time
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Taking a quick glance at the data, one can see that most of the people interviewed prefer Earl Grey, drink tea pure (apart from maybe sugar), prefer teabags and don’t drink tea at lunch. About half of the interviewees take tea without sugar, the other half with. This is just a first quick view on the summary data - the kind of tea might for example have an impact on whether one takes it with milk, sugar, lemon etc. or without. So let’s continue with a Multiple Correspondence Analysis (MCA) to dig deeper into the qualitative dataset.
Multiple Correspondence Analysis (MCA)
MCA is a method to analyze qualitative data. It can be used to find and explore patterns and (similar to PCA) reduce the dimensions. The summary() on the MCA methods provides us with statistical outputs, such as the v.test and importance of the dimensions and the correlations with the different variables.
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"))
plot(mca, invisible=c("ind"),habillage = "quali")
The summary gives us values for the different dimensions - In the MCA biplot, the first two dimensions (Dim1: 15.24%, Dim2: 14.23%) are shown. Using the habillage attribute conveniently colours the variables according to the categories (“How”: alone/lemon/milk/other, “how”:tea bag/tea bag+unpackaged/unpackaged, “lunch”: lunch/Not.lunch, “sugar”: No.sugar/sugar, “Tea”: black/Earl Grey/green and “where”: chain store/chain store+tea shop/tea shop).
Let’s investigate further: One aspect that can be explored are the different tea types and sugar. Both No.sugar am sugar are quite close to Earl Grey, but sugar is farther away from black and green tea, suggesting that more people who drink Earl Grey might add sugar than people drinking black or green tea. Another thing quite visible is how much the different variables are correlated to Dim1 and Dim2. other has apparently the highest correlation to Dim2 (followed by chain store+tea shop, tea bag+upackaged and green), whereas unpackaged and tea shop show a higher correlation to Dim1, but as well a significant correlation to Dim2.
#include libraries
library(ggplot2)
library(dplyr)
library(tidyr)
We have wrangled the original wide RATS dataset from here into a long version and saved it in the data-folder (The description of the data wrangling part can be found in the script meet_and_repeat.R in the data folder.) The dataset compares 16 rats belonging to one of three groups. Depending on the group membership they get a different diet and the weights are compared at different times over a complete timespan of 64 days.
First, let’s read in the original dataset RATS and the long and further prepared version RATSL and get an overview on what we are dealing with
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep ="\t", header = T)
RATSL <- read.csv("./data/ratsl.csv") # Read in the long version of the RATS dataset
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
Using head() and tail() one can easily see the differences between the wide (RATS) and the long version (RATSL) of the data. Whereas in RATS the different weeks are all separate variables, in RATSL they are combined to the weeks variable which has the different weeks as categories. The datatable thus becomes smaller in width but longer in length (dimensions: 176, 5), which makes it a better format for the following analysis.
head(RATS)
## ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1 1 1 240 250 255 260 262 258 266 266 265 272 278
## 2 2 1 225 230 230 232 240 240 243 244 238 247 245
## 3 3 1 245 250 250 255 262 265 267 267 264 268 269
## 4 4 1 260 255 255 265 265 268 270 272 274 273 275
## 5 5 1 255 260 255 270 270 273 274 273 276 278 280
## 6 6 1 260 265 270 275 275 277 278 278 284 279 281
head(RATSL)
## ID Group days weight Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
tail(RATSL)
## ID Group days weight Time
## 171 11 2 WD64 472 64
## 172 12 2 WD64 628 64
## 173 13 3 WD64 525 64
## 174 14 3 WD64 559 64
## 175 15 3 WD64 548 64
## 176 16 3 WD64 569 64
The variables of RATSL are as follows:
Let’s first try a plot of the weight against the time, differentiating the different nutrition study groups by color.
# Plot the RATSL data
ggplot(RATSL, aes(x = Time, y = weight, linetype = ID)) +
geom_line(aes(color=ID)) +
scale_linetype_manual(values = 1:16) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(limits = c(min(RATSL$weight), max(RATSL$weight)))
It is clearly visible, that the rats belonging to the nutrition study group 1 have the least starting weight. The rats in group 2 seem to have a steeper rise in the weight. This phenomenon - tracking individual observations - might become more visible when plotting standardized data.
Standardization To standardize the values of each observation, we subtract the relevant occasion mean from the original observation and divide it by the corresponding standard deviation: \[standardised(x) = \frac{x- mean(x)}{sd(x)} \] The standardized values will be added to a new column stdweight into our RATSL dataset. We can then explore a plot with the now standardized data.
# Standardise the variable weight
RATSL <- RATSL %>% group_by(Time) %>% mutate(stdweight = (weight - mean(weight))/sd(weight)) %>% ungroup()
# Glimpse at the RATSL data with the new column stdweight
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1,…
## $ days <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD…
## $ weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555,…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8,…
## $ stdweight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001,…
# Plot again with the standardised weight
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line(aes(color=ID)) +
scale_linetype_manual(values = 1:16) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = RATSL$stdweight)
RATSL <- RATSL %>% group_by(Time) %>% mutate(stdweight = (weight - mean(weight))/sd(weight)) %>% ungroup() does exactly the same thing as scale(), but let’s now here use the formula given above to show it clearly.
In difference to the non-standardized plot one can see that the rising character of the individual lines is not so clear anymore. Some lines show even a negative slope, others are almost horizontal. The observations of group 1 are quite stable.
Summary Graphs
The differences amongst the different group can be explored further by employing summary graphs.
This can be achieved by plotting the mean of the weight values for the different times for the three nutrition study groups respectively and add the standard error of the mean as error bars: \[ se = \frac{sd(x)}{\sqrt{n}} \]
First some preparations:
# Number of measurement times, baseline (time 0) included
n <- RATSL$Time %>% unique() %>% length()
# Summary data with mean and standard error of bprs by treatment and week
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(weight), se = sd(weight)/sqrt(n) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.3…
As already stated above, the summary graph shows a mean rising slope for all three groups and a huge difference in weight between group 1 and the two other groups.
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line(aes(color=Group)) +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1", color=Group), width=0.3) +
theme(legend.position = "top") +
scale_y_continuous(name = "mean(weight) +/- se(weight)")
Box Plots
Box plots can be another alternative for visualisation of the data.
ggplot(RATSL, aes(x = factor(Time), y = weight, fill = Group)) +
geom_boxplot(position = position_dodge(width = 0.9)) +
theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
theme(legend.position = "top") +
scale_x_discrete(name = "Time")
In this Box Plot diagram we can see the outliers in the different groups for every measurement. Still, it is not clear if this visualisation makes much sense as there are quite few individuals per group (8 in Group 1, 4 in Group 2 and 4 in Group 3). A possibility is (in order to have more datapoints) to make the mean boxplot (over time) for the different groups.
RATSLSUM <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(weight) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATSLSUM)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
ggplot(RATSLSUM, aes(x = Group, y = mean)) +
geom_boxplot() +
theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "blue") +
scale_y_continuous(name = "mean(weight), time > 1 day")
## Warning: `fun.y` is deprecated. Use `fun` instead.
As expected from the plot before, we can still see the outliers in the three different groups. Next, let’s remove these from the boxplot. Let’s start by just removing the most obvious outlier from group 2 which can be rejected by cutting on the mean weight value.
RATSLSUM1 <- RATSLSUM %>% filter(mean < 570)
glimpse(RATSLSUM1)
## Rows: 15
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, …
ggplot(RATSLSUM1, aes(x = Group, y = mean)) +
geom_boxplot() +
theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "blue") +
scale_y_continuous(name = "mean(weight), time > 1 day")
## Warning: `fun.y` is deprecated. Use `fun` instead.
Now let’s get rid of all three visible outliers. We can identify the individuals we want to filter from the summary plots and the corresponding legend. The most obvious outlier is #12 from Group 2, followed by #13 from Group 3 and #2 from Group 1.
RATSLSUM1 <- RATSLSUM %>% filter(ID !=2& ID !=12 & ID !=13)
glimpse(RATSLSUM1)
## Rows: 13
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3
## $ ID <fct> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16
## $ mean <dbl> 263.2, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 457.5, …
ggplot(RATSLSUM1, aes(x = Group, y = mean)) +
geom_boxplot() +
theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "blue") +
scale_y_continuous(name = "mean(weight), time > 1 day")
## Warning: `fun.y` is deprecated. Use `fun` instead.
Without the outliers, the boxplots are quite sharp, less skewed (there is just little variation inside the groups, but quite some between the groups). But again, the sample size is quite limited, so one could question, if this visualisation makes sense and removing three individuals from a set of 16 is already (and considering the groups 1 out of 4 for groups 2 and 3) is already quite a fraction.
T-test
To get more statistical information on the relationship between the different groups, one can do a t-test. As the t-test only compares to different groups, I will first perform group cuts to compare always two groups.
The t-test confirms – through a very low p-value – the group difference.
# CUTS ON THE GROUPS (outliers excluded)
RATSCUT23 <- RATSLSUM1 %>% filter(Group !=3)
RATSCUT13 <- RATSLSUM1 %>% filter(Group !=2)
RATSCUT12 <- RATSLSUM1 %>% filter(Group !=3)
t.test(mean ~ Group, data = RATSCUT23, var.equal = TRUE) # T-test on Group 2 and Group 3
##
## Two Sample t-test
##
## data: mean by Group
## t = -44.305, df = 8, p-value = 7.434e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -193.2012 -174.0845
## sample estimates:
## mean in group 1 mean in group 2
## 268.7571 452.4000
t.test(mean ~ Group, data = RATSCUT13, var.equal = TRUE) # T-test on Group 1 and Group 3
##
## Two Sample t-test
##
## data: mean by Group
## t = -77.715, df = 8, p-value = 8.377e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -277.5065 -261.5125
## sample estimates:
## mean in group 1 mean in group 3
## 268.7571 538.2667
t.test(mean ~ Group, data = RATSCUT12, var.equal = TRUE) # T-test on Group 1 and Group 2
##
## Two Sample t-test
##
## data: mean by Group
## t = -44.305, df = 8, p-value = 7.434e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -193.2012 -174.0845
## sample estimates:
## mean in group 1 mean in group 2
## 268.7571 452.4000
To check, if the significance changed a lot by removing the outliers, let’s do the same test including the outliers.
# CUTS ON THE GROUPS (outliers included)
RATS23 <- RATSLSUM %>% filter(Group !=3)
RATS13 <- RATSLSUM %>% filter(Group !=2)
RATS12 <- RATSLSUM %>% filter(Group !=3)
t.test(mean ~ Group, data = RATS23, var.equal = TRUE) # T-test on Group 2 and Group 3
##
## Two Sample t-test
##
## data: mean by Group
## t = -9.0646, df = 10, p-value = 3.88e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -277.5345 -168.0155
## sample estimates:
## mean in group 1 mean in group 2
## 265.025 487.800
t.test(mean ~ Group, data = RATS13, var.equal = TRUE) # T-test on Group 1 and Group 3
##
## Two Sample t-test
##
## data: mean by Group
## t = -27.824, df = 10, p-value = 8.345e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -283.4943 -241.4557
## sample estimates:
## mean in group 1 mean in group 3
## 265.025 527.500
t.test(mean ~ Group, data = RATS12, var.equal = TRUE)# T-test on Group 1 and Group 2
##
## Two Sample t-test
##
## data: mean by Group
## t = -9.0646, df = 10, p-value = 3.88e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -277.5345 -168.0155
## sample estimates:
## mean in group 1 mean in group 2
## 265.025 487.800
Still, we have quite low p-values, although the values are now some orders higher. Especially the comparison group 1 to group 3 gives a very small p-value.
Anova test
Another test that can be performed on the dataset is the Anova test. For it, we add a baseline to be able to perform baseline measurements of the outcome variable. We will define the values of WD1 from the original RATS dataset as the baseline.
# Add the baseline from the original data as a new variable to the summary data
RATSLSUM2 <- RATSLSUM %>% mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSLSUM2)
summary(fit)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSLSUM2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.905 -4.194 2.190 7.577 14.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.16375 21.87657 1.516 0.1554
## baseline 0.92513 0.08572 10.793 1.56e-07 ***
## Group2 34.85753 18.82308 1.852 0.0888 .
## Group3 23.67526 23.25324 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared: 0.9936, Adjusted R-squared: 0.992
## F-statistic: 622.1 on 3 and 12 DF, p-value: 1.989e-13
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One can see that the baseline of RATS is strongly related to the BPRS values to the weight values in the following time, where the significance for the group has a higher p-value (about 0.076). Thus, the weight gain seems to be depending more on the starting weight than on the provided diet.
We have wrangled the original wide BPRS dataset from here into a long version and saved it in the data-folder (The description of the data wrangling part can be found in the script meet_and_repeat.R in the data folder.) The data set compares a certain psychological value (BPRS) for 40 individuals over a timespan of several weeks. The individuals are given two different medicamentations and are thus either belonging to treatment group 1 or 2.
The variables are as follows:
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T) # Original BPRS data
BPRSL <- read.csv("./data/bprsl.csv") %>% as.data.frame() # Read in the long version of the BPRS dataset
BPRSL$treatment <- factor(BPRSL$treatment) # factor the categorical values
BPRSL$subject <- factor(BPRSL$subject)
As it was for the RATS and RATSL dataset, we can see the differences between wide and long form by glimpsing at the data and looking at it with head() and tail().
glimpse(BPRS)
## Rows: 40
## Columns: 11
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ week0 <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week1 <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68,…
## $ week2 <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65,…
## $ week3 <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49,…
## $ week4 <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36,…
## $ week5 <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32,…
## $ week6 <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27,…
## $ week7 <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30,…
## $ week8 <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37,…
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
head(BPRS)
## treatment subject week0 week1 week2 week3 week4 week5 week6 week7 week8
## 1 1 1 42 36 36 43 41 40 38 47 51
## 2 1 2 58 68 61 55 43 34 28 28 28
## 3 1 3 54 55 41 38 43 28 29 25 24
## 4 1 4 55 77 49 54 56 50 47 42 46
## 5 1 5 72 75 72 65 50 39 32 38 32
## 6 1 6 48 43 41 38 36 29 33 27 25
head(BPRSL); tail(BPRSL)
## treatment subject weeks bprs week
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
## treatment subject weeks bprs week
## 355 2 15 week8 37 8
## 356 2 16 week8 22 8
## 357 2 17 week8 43 8
## 358 2 18 week8 30 8
## 359 2 19 week8 21 8
## 360 2 20 week8 23 8
The different week variables are now gathered together in one weeks variable and another column with the integer values of the weeks is appended.
Let’s try looking at the data with a bprs against week plot, numbering the different subjects.
# Check the dimensions of the data
dim(BPRSL)
## [1] 360 5
# Plot the BPRSL data
ggplot(BPRSL, aes(x= week, y = bprs, group = treatment)) +
geom_text(aes(label = subject, color=treatment)) +
scale_x_discrete(name = "Week") +
scale_y_discrete(name = "bprs") +
theme_bw() +
theme(legend.position = "top") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
As we can see in the above plot, we have the different subject ID double, because the numbers 1-20 were both given to the different individuals belonging to group 1 and 2. Thus, it makes sense to differentiate between the treatment groups. The same plot can be done as well without the subject IDs, to see if we can see some group specific patterns.
ggplot(BPRSL, aes(x= week, y = bprs, group = treatment)) +
geom_point(aes(color = treatment)) +
scale_x_continuous(name = "Week" , breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "bprs") +
theme_bw() +
theme(legend.position = "top") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
The below plot makes it easier to track the different subjects and divide between the different groups. It is a bit less “chaotic” like this.
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line(aes(color=subject)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
As the visualisations are still quite chaotic, we will fit as a next step a linear regression model, using the weight as response and Time and Group as explanatory variables.
# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
There is a high significance for a linear relationship between weeks and bprs. However, there is no significant variation between treatment group 1 and 2.
The model above still ignores the repeated measures and just evaluates every observation of the weight as uncorrelated - this is probably far from the truth. So we will now look at more advanced models. The random intercept model makes it possible for the different subjects to have different intercepts in the linear regression fit. In the first fit of the random intercept model we use again week and treatment as explanatory variables.
For the fit the lme4 package can be used: It again takes the formula as a first argument, but in addition to the fixed effects also takes into account random-effects, which are distinguished from the other variables with a | (1 referring to the intercept).
# access library lme4
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
If we now also include the week as a random effect, the individual subjects will be also allowed to differ in slope, so that we can account in this case also for differences in time.
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
One can compare the two different models using Anova again.
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing the \(\chi^2\) values of the two fits, we can see that BPRS_ref1 seems to be better that BPRS_ref.
# create a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again looking at the \(\chi^2\) , BPRS_ref2 performs a bit better than BPRS_ref1, so let’s take this model for another visualisation. We save the fitted values of this model in a vector, add it to our BPRSL data as another column and then do the same plot as above with the fitted data.
# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>% mutate(fitted = Fitted)
# draw the plot of BPRSL with the fitted values of weight
ggplot(BPRSL, aes(x= week, y = fitted, color=subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=2)) +
facet_grid(. ~ treatment, labeller = label_both)
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line(aes(color=subject)) +
scale_linetype_manual(values = rep(1:10, times=2)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
Comparing with the original plot, one can see that the model is not perfect, but shows the tendencies.